Introduction with easy example

I will introduce tidytext with the book Text mining with R. It can be fully found on the internet and is an interesting read.

text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")

text
## [1] "Because I could not stop for Death -"  
## [2] "He kindly stopped for me -"            
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"

Put dataset we just created to a dataframe. We will use the data_frame()-function of dplyr for this.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
text_df <- data_frame(line = 1:4, text = text)

text_df
## # A tibble: 4 x 2
##    line                                   text
##   <int>                                  <chr>
## 1     1   Because I could not stop for Death -
## 2     2             He kindly stopped for me -
## 3     3 The Carriage held but just Ourselves -
## 4     4                        and Immortality

We want to break these words appart. That is when we need tidytext:

library(tidytext)

text_df %>%
  unnest_tokens(word, text)
## # A tibble: 20 x 2
##     line        word
##    <int>       <chr>
##  1     1     because
##  2     1           i
##  3     1       could
##  4     1         not
##  5     1        stop
##  6     1         for
##  7     1       death
##  8     2          he
##  9     2      kindly
## 10     2     stopped
## 11     2         for
## 12     2          me
## 13     3         the
## 14     3    carriage
## 15     3        held
## 16     3         but
## 17     3        just
## 18     3   ourselves
## 19     4         and
## 20     4 immortality

The two basic arguments to unnest_tokens used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (text, in this case). Remember that text_df above has a column called text that contains the data of interest.

Having the text data in this format lets us manipulate, process, and visualize the text using the standard set of tidy tools, namely dplyr, tidyr, and ggplot2.

Tidying the work of Jane Austen

In the book, the examples are given with the books of Jane Austen. They explain clearly how everything works and the examples are very good. So I will use the same example that makes it easy to follow along.

library(janeaustenr)
library(dplyr)
library(stringr)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
  ungroup()

original_books
## # A tibble: 73,422 x 4
##                     text                book linenumber chapter
##                    <chr>              <fctr>      <int>   <int>
##  1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
##  2                       Sense & Sensibility          2       0
##  3        by Jane Austen Sense & Sensibility          3       0
##  4                       Sense & Sensibility          4       0
##  5                (1811) Sense & Sensibility          5       0
##  6                       Sense & Sensibility          6       0
##  7                       Sense & Sensibility          7       0
##  8                       Sense & Sensibility          8       0
##  9                       Sense & Sensibility          9       0
## 10             CHAPTER 1 Sense & Sensibility         10       1
## # ... with 73,412 more rows

To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format:

library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books
## # A tibble: 725,055 x 4
##                   book linenumber chapter        word
##                 <fctr>      <int>   <int>       <chr>
##  1 Sense & Sensibility          1       0       sense
##  2 Sense & Sensibility          1       0         and
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0          by
##  5 Sense & Sensibility          3       0        jane
##  6 Sense & Sensibility          3       0      austen
##  7 Sense & Sensibility          5       0        1811
##  8 Sense & Sensibility         10       1     chapter
##  9 Sense & Sensibility         10       1           1
## 10 Sense & Sensibility         13       1         the
## # ... with 725,045 more rows

The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

In an analysis, we often do not want to use stopwords. We would like to remove them:

data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words)
## Joining, by = "word"

Now, we will use the count()-function from dplyr to count the words in the text:

tidy_books %>%
  count(word, sort = TRUE) 
## # A tibble: 13,914 x 2
##      word     n
##     <chr> <int>
##  1   miss  1855
##  2   time  1337
##  3  fanny   862
##  4   dear   822
##  5   lady   817
##  6    sir   806
##  7    day   797
##  8   emma   787
##  9 sister   727
## 10  house   699
## # ... with 13,904 more rows

If we want, we can use dplyr in combination with ggplot2 to imediatly see the results in a graph:

library(ggplot2)
tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

This was the very basis of tidytext. The count of the words will help you already a lot about your text data. But of course there is a lot more you can do!

Analysis with text data

Sentiment analysis

Opinion mining or sentiment analysis. When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust.

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words.

Sentiments inside the Jane Austin novels:

Most negative and positive words:

## Selecting by n

And of course, a wordcloud to show the words:

## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Loading required package: RColorBrewer
## Joining, by = "word"

Only for information

Analyzing word and document frequency: tf-idf

How to quantify what a document is about. Can we do this by looking at the words that make up the document? One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document. There are words in a document, however, that occur many times but may not be important; in English, these are probably words like “the”, “is”, “of”, and so forth. We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a very sophisticated approach to adjusting term frequency for commonly used words.

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites.

Tf

## Joining, by = "book"
## # A tibble: 40,379 x 4
##                 book  word     n  total
##               <fctr> <chr> <int>  <int>
##  1    Mansfield Park   the  6206 160460
##  2    Mansfield Park    to  5475 160460
##  3    Mansfield Park   and  5438 160460
##  4              Emma    to  5239 160996
##  5              Emma   the  5201 160996
##  6              Emma   and  4896 160996
##  7    Mansfield Park    of  4778 160460
##  8 Pride & Prejudice   the  4331 122204
##  9              Emma    of  4291 160996
## 10 Pride & Prejudice    to  4162 122204
## # ... with 40,369 more rows

There is one row in this book_words data frame for each word-book combination; n is the number of times that word is used in that book and total is the total words in that book. The usual suspects are here with the highest n, “the”, “and”, “to”, and so forth. In Figure 3.1, let’s look at the distribution of n/total for each novel, the number of times a word appears in a novel divided by the total number of terms (words) in that novel. This is exactly what term frequency is.

library(ggplot2)

ggplot(book_words, aes(n/total, fill = book)) +
  geom_histogram(show.legend = FALSE, bins = 30) +
  xlim(NA, 0.0009) +
  facet_wrap(~book, ncol = 2, scales = "free_y")

Tf-idf

book_words <- book_words %>%
  bind_tf_idf(word, book, n)
book_words
## # A tibble: 40,379 x 7
##                 book  word     n  total         tf   idf tf_idf
##               <fctr> <chr> <int>  <int>      <dbl> <dbl>  <dbl>
##  1    Mansfield Park   the  6206 160460 0.03867631     0      0
##  2    Mansfield Park    to  5475 160460 0.03412065     0      0
##  3    Mansfield Park   and  5438 160460 0.03389007     0      0
##  4              Emma    to  5239 160996 0.03254118     0      0
##  5              Emma   the  5201 160996 0.03230515     0      0
##  6              Emma   and  4896 160996 0.03041069     0      0
##  7    Mansfield Park    of  4778 160460 0.02977689     0      0
##  8 Pride & Prejudice   the  4331 122204 0.03544074     0      0
##  9              Emma    of  4291 160996 0.02665284     0      0
## 10 Pride & Prejudice    to  4162 122204 0.03405780     0      0
## # ... with 40,369 more rows
book_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(book) %>% 
  top_n(5) %>% 
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, ncol = 2, scales = "free") +
  coord_flip()
## Selecting by tf_idf

Relationships between words: n-grams and correlations

Many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same documents.

library(dplyr)
library(tidytext)
library(janeaustenr)

austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

austen_bigrams
## # A tibble: 725,049 x 2
##                   book          bigram
##                 <fctr>           <chr>
##  1 Sense & Sensibility       sense and
##  2 Sense & Sensibility and sensibility
##  3 Sense & Sensibility  sensibility by
##  4 Sense & Sensibility         by jane
##  5 Sense & Sensibility     jane austen
##  6 Sense & Sensibility     austen 1811
##  7 Sense & Sensibility    1811 chapter
##  8 Sense & Sensibility       chapter 1
##  9 Sense & Sensibility           1 the
## 10 Sense & Sensibility      the family
## # ... with 725,039 more rows
library(tidyr)

bigrams_separated <- austen_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts
## # A tibble: 33,421 x 3
##      word1     word2     n
##      <chr>     <chr> <int>
##  1     sir    thomas   287
##  2    miss  crawford   215
##  3 captain wentworth   170
##  4    miss woodhouse   162
##  5   frank churchill   132
##  6    lady   russell   118
##  7    lady   bertram   114
##  8     sir    walter   113
##  9    miss   fairfax   109
## 10 colonel   brandon   108
## # ... with 33,411 more rows

Topic modeling

Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without diving into the math behind the model, we can understand it as being guided by two principles.

  • Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”
  • Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. There are a number of existing implementations of this algorithm, and we’ll explore one of them in depth.

Example: the great library heist

titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
            "Pride and Prejudice", "Great Expectations")
library(gutenbergr)

books <- gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title")
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
library(stringr)

# divide into documents, each representing one chapter
by_chapter <- books %>%
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter)

# split into words
by_chapter_word <- by_chapter %>%
  unnest_tokens(word, text)

# find document-word counts
word_counts <- by_chapter_word %>%
  anti_join(stop_words) %>%
  count(document, word, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(topicmodels)
chapters_dtm <- word_counts %>%
  cast_dtm(document, word, n)

chapters_dtm
## <<DocumentTermMatrix (documents: 193, terms: 18215)>>
## Non-/sparse entries: 104722/3410773
## Sparsity           : 97%
## Maximal term length: 19
## Weighting          : term frequency (tf)
chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
chapters_lda
## A LDA_VEM topic model with 4 topics.
chapter_topics <- tidy(chapters_lda, matrix = "beta")
chapter_topics
## # A tibble: 72,860 x 3
##    topic    term         beta
##    <int>   <chr>        <dbl>
##  1     1     joe 1.436612e-17
##  2     2     joe 5.962111e-61
##  3     3     joe 9.881855e-25
##  4     4     joe 1.447329e-02
##  5     1   biddy 5.139275e-28
##  6     2   biddy 5.022015e-73
##  7     3   biddy 4.307280e-48
##  8     4   biddy 4.775557e-03
##  9     1 estella 2.431464e-06
## 10     2 estella 4.323253e-68
## # ... with 72,850 more rows
top_terms <- chapter_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms
## # A tibble: 20 x 3
##    topic      term        beta
##    <int>     <chr>       <dbl>
##  1     1 elizabeth 0.014101270
##  2     1     darcy 0.008810341
##  3     1      miss 0.008708777
##  4     1    bennet 0.006944344
##  5     1      jane 0.006494613
##  6     2   captain 0.015510635
##  7     2  nautilus 0.013051927
##  8     2       sea 0.008843483
##  9     2      nemo 0.008709651
## 10     2       ned 0.008031955
## 11     3    people 0.006785987
## 12     3  martians 0.006456394
## 13     3      time 0.005343667
## 14     3     black 0.005277449
## 15     3     night 0.004491174
## 16     4       joe 0.014473289
## 17     4      time 0.006852889
## 18     4       pip 0.006828209
## 19     4    looked 0.006366418
## 20     4      miss 0.006232761
library(ggplot2)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()